SPARQL query processing with Apache Spark

نویسندگان

  • Hubert Naacke
  • Olivier Curé
  • Bernd Amann
چکیده

The number and the size of linked open data graphs keep growing at a fast pace and confronts semantic RDF services with problems characterized as Big data. Distributed query processing is one of them and needs to be efficiently addressed with execution guaranteeing scalability, high availability and fault tolerance. RDF data management systems requiring these properties are rarely built from scratch but are rather designed on top of an existing engine. In this work, we consider the processing of SPARQL queries with the current state of the art cluster computing engine, namely Apache Spark. We propose and compare five different query processing approaches based on different join execution models and Spark components. A detailed experimentation on real-world and synthetic data sets promotes two new approaches tailored for the RDF data model which outperform (by a factor of up to 2.4 on query execution time compared to a state of the art distributed SPARQL processing engine) the other ones on all major query shapes, i.e., star, snowflake, chain and their composition.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PRoST: Distributed Execution of SPARQL Queries Using Mixed Partitioning Strategies

The rapidly growing size of RDF graphs in recent years necessitates distributed storage and parallel processing strategies. To obtain efficient query processing using computer clusters a wide variety of different approaches have been proposed. Related to the approach presented in the current paper are systems built on top of Hadoop HDFS, for example using Apache Accumulo or using Apache Spark. ...

متن کامل

Towards a distributed, scalable and real-time RDF Stream Processing engine

Due to the growing need to timely process and derive valuable information and knowledge from data produced in the Semantic Web, RDF stream processing (RSP) has emerged as an important research domain. Of course, modern RSP have to address the volume and velocity characteristics encountered in the Big Data era. This comes at the price of designing high throughput, low latency, fault tolerant, hi...

متن کامل

S2RDF: RDF Querying with SPARQL on Spark

RDF has become very popular for semantic data publishing due to its flexible and universal graph-like data model. Thus, the ever-increasing size of RDF data collections raises the need for scalable distributed approaches. We endorse the usage of existing infrastructures for Big Data processing like Hadoop for this purpose. Yet, SPARQL query performance is a major challenge as Hadoop is not inte...

متن کامل

HAQWA: a Hash-based and Query Workload Aware Distributed RDF Store

Like most data models encountered in the Big Data ecosystem, RDF stores are managing large data sets by partitioning triples across a cluster of machines. Nevertheless, the graphical nature of RDF data as well as its associated SPARQL query execution model makes the efficient data distribution more involved than in other data models, e.g., relational. In this paper, we propose a novel system th...

متن کامل

From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra

MapReduce-based data processing platforms offer a promising approach for cost-effective and Web-scale processing of Semantic Web data. However, one major challenge is that this computational paradigm leads to high I/O and communication costs when processing tasks with several join operations typical in SPARQL queries. The goal of this demonstration is to show how a system RAPID+, an extension o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1604.08903  شماره 

صفحات  -

تاریخ انتشار 2016